Add interleaving to sgemm and dgemm. Disentangle trmm/symm from gemm. #5573

almayne · 2025-12-17T09:10:07Z

This change adds interleaving to sgemm and dgemm copies and kernels for ARMV8SVE.
This required a degree of disentangling symm and trmm kernels from gemm. It should now be much easier to apply further optimisations to gemm.

The addition of interleaving provides a ~1.4% speedup on c7g (V1), with negligible changes on c8g (V2).

Taken over square matrix operations with size 2->2014, stepsize = 1:
Geometric mean for interleave/c7g_dgemm.txt: 0.9859023206257058
Geometric mean for interleave/c7g_sgemm.txt: 0.9887890902680289
Geometric mean for interleave/c8g_dgemm.txt: 0.9970050554316875
Geometric mean for interleave/c8g_sgemm.txt: 0.9948135816755502

We see an increase in the sgemm speedup (~2.4%) on c7g for larger matrix sizes.

Taken over square matrix operations with size 2,000->10,000, stepsize = 1,000:
Geometric mean for 64thread_interleave/c7g_dgemm.txt: 0.9865252964543917
Geometric mean for 64thread_interleave/c7g_sgemm.txt: 0.9762227312411808
Geometric mean for 64thread_interleave/c8g_dgemm.txt: 0.9997186302044462
Geometric mean for 64thread_interleave/c8g_sgemm.txt: 0.9996022927667269

aditew01 · 2025-12-17T13:18:24Z

@martin-frbg @Mousius can you please have a look?

…emm. Co-authored-by: Chris Sidebottom <[email protected]>

…ng of copyright notices added in last commit.

martin-frbg · 2026-01-06T11:26:59Z

Thanks for looking into this - though I'm not immediately convinced that the speedup warrants the added complexity @Mousius ? At least I'd like to put this off until after the 0.3.31 release.

Mousius · 2026-01-06T11:59:50Z

Thanks for looking into this - though I'm not immediately convinced that the speedup warrants the added complexity @Mousius ? At least I'd like to put this off until after the 0.3.31 release.

The speedup is one thing, the other is enabling GEMM kernels without the need to also implement the other kernels. This would enable us to land existing SME kernels, as previously proposed here: #5011 (comment)

Putting off until after 0.3.31 makes perfect sense to me as it's a relatively high risk change.

almayne · 2026-01-06T13:05:28Z

Thanks for looking into this - though I'm not immediately convinced that the speedup warrants the added complexity @Mousius ? At least I'd like to put this off until after the 0.3.31 release.

Hi Martin. Thanks for taking a look. I'm happy for this to go in after the release. Do you have a rough estimate for when that might be, so I can share internally?

almayne · 2026-01-16T09:34:05Z

@martin-frbg congrats on getting the release out. Would it be possible to get this merged?

martin-frbg · 2026-01-23T10:08:41Z

I'm probably missing something obvious, but I'm still a bit confused why this needs the addition of "COMM" kernels instead of declaring dedicated ?TRMM and ?SYMM COPY kernels within the existing framework, like the riscv64 folks do in e.g. KERNEL.RISCV64_ZVL128B ?

almayne · 2026-01-23T10:45:09Z

I'm probably missing something obvious, but I'm still a bit confused why this needs the addition of "COMM" kernels instead of declaring dedicated ?TRMM and ?SYMM COPY kernels within the existing framework, like the riscv64 folks do in e.g. KERNEL.RISCV64_ZVL128B ?

The intent is to capture what is still common between trmm and symm, and reduce duplicate code. I can refactor if preferred? @Mousius let me know if you have opinions either way. No preference on my part.

Mousius · 2026-02-02T18:11:13Z

I agree with @martin-frbg, it'd be better to name them aligned with the algorithms they're influencing.

@martin-frbg we're targeting these types of blocks:

OpenBLAS/driver/level3/trmm_L.c

Lines 151 to 161 in 1a9cf8e

    
                 GEMM_ONCOPY(min_l, min_jj, b + (jjs * ldb) * COMPSIZE, ldb, sb + min_l * (jjs - js) * COMPSIZE); 
        
                 STOP_RPCC(outercost); 
        
                 START_RPCC(); 
        
                 TRMM_KERNEL_N(min_i, min_jj, min_l, dp1, 
        
           #ifdef COMPLEX 
        
           		    ZERO, 
        
           #endif 
        
           		    sa, sb + min_l * (jjs - js) * COMPSIZE, b + (jjs * ldb) * COMPSIZE, ldb, 0);

I think that'd be adding SYMM_ONCOPY/TRMM_ONCOPY to replace GEMM_ONCOPY if we override it, else falling back to the GEMM_ONCOPY?

martin-frbg · 2026-02-02T21:21:27Z

Yes, at least that's my current understanding - that mapping ?SYMM_ONCOPY et al in the KERNEL file should do. I haven't tried it though, as my lowly M4 doesn't handle the SVE bits - guess I'd need to look into hijacking our Numpy-sponsored AWS CI job to run on M9g ?

Mousius · 2026-02-03T10:50:28Z

I don't think we currently have ?SYMM_ONCOPY, the RISCV kernels replace the triangle copies along M:

OpenBLAS/kernel/riscv64/KERNEL.RISCV64_ZVL128B

Lines 169 to 172 in 1a9cf8e

    
           STRMMUNCOPY_M  =  ../generic/trmm_uncopy_$(SGEMM_UNROLL_M).c 
        
           STRMMLNCOPY_M  =  ../generic/trmm_lncopy_$(SGEMM_UNROLL_M).c 
        
           STRMMUTCOPY_M  =  ../generic/trmm_utcopy_$(SGEMM_UNROLL_M).c 
        
           STRMMLTCOPY_M  =  ../generic/trmm_ltcopy_$(SGEMM_UNROLL_M).c

Which was originally done for SVE:

OpenBLAS/kernel/arm64/KERNEL.ARMV8SVE

Lines 149 to 152 in 1a9cf8e

    
           STRMMUNCOPY_M  =  trmm_uncopy_sve_v1.c 
        
           STRMMLNCOPY_M  =  trmm_lncopy_sve_v1.c 
        
           STRMMUTCOPY_M  =  trmm_utcopy_sve_v1.c 
        
           STRMMLTCOPY_M  =  trmm_ltcopy_sve_v1.c

So we're aware of these ones being overridable, but the GEMM_ONCOPY doesn't have an override as yet.

For a sanity check, I couldn't find any reference to SYMM_ONCOPY:

SYMM_ONCOPY / SYMMONCOPY / SSYMM_ONCOPY / SSYMMONCOPY

almayne and others added 4 commits December 18, 2025 14:42

Add interleaving to sgemm and dgemm. Disentangle trmm and symm from g…

f3c78c9

…emm. Co-authored-by: Chris Sidebottom <[email protected]>

Fixed builds and added missing copyright notices. Also fixed formatti…

651578e

…ng of copyright notices added in last commit.

Accommodate ex and quad precision builds.

959d3b3

Add new copy functions to ex and quad precision builds.

0a205ee

almayne force-pushed the sgemm_interleave branch from 8316dc1 to 0a205ee Compare December 18, 2025 15:26

almayne added 4 commits December 18, 2025 15:28

Fix CMake build.

570adff

Fix the CMake build: updates for new copy functions.

a9f5a9b

Merge branch 'OpenMathLib:develop' into sgemm_interleave

12c0752

Added to list of contributions.

815b50c

martin-frbg added this to the 0.3.32 milestone Jan 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add interleaving to sgemm and dgemm. Disentangle trmm/symm from gemm. #5573

Add interleaving to sgemm and dgemm. Disentangle trmm/symm from gemm. #5573

Uh oh!

almayne commented Dec 17, 2025

Uh oh!

aditew01 commented Dec 17, 2025 •

edited

Loading

Uh oh!

martin-frbg commented Jan 6, 2026

Uh oh!

Mousius commented Jan 6, 2026

Uh oh!

almayne commented Jan 6, 2026

Uh oh!

almayne commented Jan 16, 2026

Uh oh!

martin-frbg commented Jan 23, 2026

Uh oh!

almayne commented Jan 23, 2026

Uh oh!

Mousius commented Feb 2, 2026

Uh oh!

martin-frbg commented Feb 2, 2026

Uh oh!

Mousius commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add interleaving to sgemm and dgemm. Disentangle trmm/symm from gemm. #5573

Are you sure you want to change the base?

Add interleaving to sgemm and dgemm. Disentangle trmm/symm from gemm. #5573

Uh oh!

Conversation

almayne commented Dec 17, 2025

Uh oh!

aditew01 commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented Jan 6, 2026

Uh oh!

Mousius commented Jan 6, 2026

Uh oh!

almayne commented Jan 6, 2026

Uh oh!

almayne commented Jan 16, 2026

Uh oh!

martin-frbg commented Jan 23, 2026

Uh oh!

almayne commented Jan 23, 2026

Uh oh!

Mousius commented Feb 2, 2026

Uh oh!

martin-frbg commented Feb 2, 2026

Uh oh!

Mousius commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aditew01 commented Dec 17, 2025 •

edited

Loading